Clustering blog entries based on the hybrid document model enhanced by the extended anchor texts and co-referencing links
نویسندگان
چکیده
In this paper, we propose a document vector space model where weights of noun terms vary depending on positions within the texts of blog entries as search results. We extend “extended anchor texts” (i.e., extra texts surrounding anchor texts) with the exponential potential such that the weight of a noun term decreases exponentially as the distance between the term and link increases. In order to cluster blog entries as search results, we use the hybrid vector space model which takes into account both texts including extended anchor texts and co-references of Web pages through links described in blog entries. We evaluate the effects of our scheme on clustering blog search results.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملیک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملClustering web documents using co-citation, coupling, incoming, and outgoing hyperlinks: a comparative performance analysis of algorithms
Querying search engines with the keyword ”jaguars” returns results as diverse as web sites about cars, computer games, attack planes, American football, and animals. More and more search engines offer options to organize query results by categories or, given a document, to return a list of links to topically related documents. While information retrieval traditionally defines similarity of docu...
متن کاملModeling of Nanofiltration for Concentrated Electrolyte Solutions using Linearized Transport Pore Model
In this study, linearized transport pore model (LTPM) is applied for modeling nanofiltration (NF) membrane separation process. This modeling approach is based on the modified extended Nernst-Planck equation enhanced by Debye-Huckel theory to take into account the variations of activity coefficient especially at high salt concentrations. Rejection of single-salt (NaCl) electrolyte is inve...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کامل